Inference for a proportion

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • Quantifying the uncertainty of the sample proportion
  • How to construct and interpret a confidence interval for the population proportion
  • How to conduct and interpret a hypothesis test for the population proportion

p from one sample

CS 1.5 revisited: Wordle

A snapshot of Wordle guess distributions from David and his Wordle obsessed friends.

Variables
Count An integer denoting the frequency of Guesses
Initials A factor denoting whose Wordle guess distribution it is with 5 levels
Guesses A factor denoting how many guesses it took to complete the daily Wordle (as you lose if your 6th guess is incorrect) with 7 levels
wordle.df <- read.csv("datasets/wordle.csv")

xtabs(Count ~ Guesses, data = wordle.df) |> 
  as.data.frame() |>
  barchart(Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses", 
           ylab = "Counts", main = "Wordle guess distribution")

Figure: The Wordle guess distribution for David and friends

Simulating random samples

Sampling distribution of ˆp

If the population proportion, \(p\), is known—The ground “truth” (parameter) that summarise all possible values we could observe

The sampling distribution of the sample proportion, \(\hat{p}\), is

\[ \hat{p} ~ \text{approx.} ~ \text{Normal} \! \left(\mu_{\hat{p}} = p, \sigma_{\hat{p}} = \sqrt{\frac{p\times(1-p)}{n}} \right) \]

The use of the \(\hat{p}\) subscripts is to make it clear that we are talking about the sampling distribution of \(\hat{p}\) and not the possible values we could observe

Assumptions for inference on p (for the method taught in DATAX121)

  1. Independent observations—typically met with random samples or randomisation of the data collection order with randomised experiments
  2. The following heuristic1 has to be met:
    At least ten “yes” values and at least ten “no” values in the sample
    \(n \times \hat{p} \geq 10\) and \(n \times (1 - \hat{p}) \geq 10\)
  3. For the hypothesis test for \(p\), the following heuristic1 also has to be met:
    At least ten “yes” values and at least ten “no” values in the sample according to the hypothesised value of the population (underlying) proportion
    \(n \times p_0 \geq 10\) and \(n \times (1 - p_0) \geq 10\)

More on 2. & 3.

These heuristics are a consequence of relying only on the sampling distribution of \(\hat{p}\)

Definition: se(ˆp)

The standard error of the sample proportion, \(\hat{p}\), is

\[ \text{se}(\hat{p}) = \sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}} \]

where:

  • \(\hat{p}\) is the sample proportion for the level of interest
  • \(n\) is the number of observations

Thought Question
If we hold \(n\) constant, what value of \(\hat{p}\) maximises its standard error?

This particular fact, alongside the \(z\)-multiplier (see Slide 12), is often used by polling companies to quantify the uncertainty of their data—and sometimes incorrectly

CS 7.1: US Teens, Technology, and Friendships

A survey of 1060 randomly selected US teens ages 13 to 17 found that 605 of them say they have made a new friend online.

It was of interest to infer the population proportion of all US teens who have made a new friend online using this data. Furthermore, use this data to test whether more than 50% of all US teens have made a new friend online.

Made a new friend online?

# Commonly data for proportions are summarised by groups
data <- c(605, 455); groups <- c("Yes", "No")

barchart(data ~ groups, origin = 0, 
         xlab = "Made a new friend online?", ylab = "Frequency",
         main = "Distribution of survey responses")

Figure: The 1060 survey responses to ‘Made a new friend online?’

CS 7.1: US Teens, Technology, and Friendships

Are all three assumptions met?

(Because we are also conducting a hypothesis test for CS 7.1)

0.0152028, 0.0153574

# Force R to print "Yes" first
groups.f <- factor(c("Yes", "No"), levels = c("Yes", "No"))

barchart(data ~ groups.f, origin = 0, 
         xlab = "Made a new friend online?", ylab = "Frequency",
         main = "Distribution of survey responses")

Figure: The 1060 survey responses to ‘Made a new friend online?’

A confidence interval for p

Definition: (1 - α)% Confidence interval for p

\[ \hat{p} \pm z^*_{1-\alpha/2} \times \text{se}(\hat{p}) \]

where:

  • \(\hat{p}\) is the sample proportion for the level of interest
  • \(n\) is the number of observations
  • The confidence level is \((1 - \alpha)\), where \(\alpha\) is a proportion
  • \(z^*_{1-\alpha/2}\) is the z-multiplier for the prescribed confidence level of \((1 - \alpha)\)
  • \(\text{se}(\hat{p})\) is the standard error of \(\hat{p}\)—see Slide 8

Briefly: z-multiplier

Recall the following assumption for inference on \(p\)

The following heuristic has to be met: \(n \times \hat{p}\) and \(n \times (1 - \hat{p})\) are greater than or equal to 10

The theoretical justification for this arises from the fact that this method of constructing a confidence interval with a \(z\)-multiplier works “most” of the time without specifying a formal statistical model for the data

CS 7.1: US Teens, Technology, and Friendships

Recall that 605 out of the 1060 US teens said they made a new friend online. Construct a 95% confidence interval for the population proportion of all US teens who have made a new friend online \[ \hat{p} = \frac{605}{1060}, \quad \text{se}(\hat{p}) = 0.0152 \]

The solution is (0.5409577, 0.6005517)

# The R function to find the z-multiplier
qnorm(0.975)
[1] 1.959964
# The R function to calculate it in one go
prop.test(x = 605, n = 1060, correct = FALSE)

    1-sample proportions test without continuity correction

data:  605 out of 1060, null probability 0.5
X-squared = 21.226, df = 1, p-value = 4.081e-06
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5407550 0.6002435
sample estimates:
        p 
0.5707547 

Interpretation of a confidence interval for p

For CS 7.1, the 95% confidence interval for the population proportion of all US teens who have made a new friend online was (0.5409577, 0.6005517)

A hypothesis test for p

The method covered in DATAX121 is known as a z-test for a proportion

An interesting nuance…

The \(\text{se}(\hat{p})\) is defined solely in terms of the observed statistic and the number of observations

Could we use the Normal approximation from Slide 6 directly for the hypothesis test? Yes, we can!

This is because we want to see if \(\hat{p}\) is in an unusual place of a distribution we would expect to see if \(H_0\) is true

Therefore, the test statistic for \(p\) in DATAX121 instead uses the standard error of the hypothesised value of the population (underlying) proportion \(p_0\)

\[ \text{se}(p_0) = \sqrt{\frac{p_0\times(1-p_0)}{n}} \]

Definition: The test statistic for p

\[ z_0 = \frac{\hat{p} - p_0}{\text{se}(p_0)} \]

where:

  • \(z_0\) is the Z-test statistic (for p)
  • \(\bar{p}\) is the sample proportion
  • \(n\) is the number of observations
  • \(p_0\) is the hypothesised value of the population proportion
  • \(\text{se}(p_0)\) is the standard error of \(p_0\)—see Slide 17

Calculation of the p-value (for p)

Let \(Z\) be the Standard Normal distribution1

If it is a two-sided test, e.g. \(H_1 \! : p \neq p_0\)

\(\quad p\text{-value} = 2 \times \mathbb{P}(Z > |z_0|)\)

If it is a one-sided test and \(H_1 \! : p > p_0\)

\(\quad p\text{-value} = \mathbb{P}(Z > z_0)\)

If it is a one-sided test and \(H_1 \! : p < p_0\)

\(\quad p\text{-value} = \mathbb{P}(Z < z_0)\)

CS 7.1: US Teens, Technology, and Friendships

Recall that 605 out of the 1060 US teens said they made a new friend online. Use this data to test whether more than 50% of all US teens have made a new friend online. \[ \hat{p} = \frac{605}{1060}, \quad \text{se}(\hat{p}) = 0.0152 \]

Hypothesis statements
\(\quad H_0\!: p = 0.5\)
\(\quad H_1\!: p > 0.5\)

Lastly, we need \(\text{se}(p_0)\)

0.0153574
4.6072134
21.2264151

# The R function to calculate it in one go
prop.test(x = 605, n = 1060, p = 0.5, 
          alternative = "greater", correct = FALSE)

    1-sample proportions test without continuity correction

data:  605 out of 1060, null probability 0.5
X-squared = 21.226, df = 1, p-value = 2.041e-06
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
 0.5455993 1.0000000
sample estimates:
        p 
0.5707547 

Interpretation of a hypothesis test for p

For CS 7.1, the exact p-value for the appropriate set of hypothesis statements was 2.0405×10-6

CS 7.2: A very old telephone poll

Hull, J. D. (1994, December 26). Tale of One Parish. Time, 144(26), 74–76.

CS 7.2: A very old telephone poll

Time magazine reported that in a 1994 survey of 507 randomly selected adult American Catholics, 59% answered yes to the question “Do you favour allowing women to be priests?”

Does this data indicate that the majority of all adult American Catholics are in favour?

Variables
Answer A factor denoting whether a survey respondent answered either yes or no to the question “Do you favour allowing women to be priests?”
times.df <- read.csv("datasets/times-poll.csv")

# Tells R that we want to organise the responses when summarising 
  # as "Yes" then "No"
times.df$Answer <- factor(times.df$Answer, levels = c("Yes", "No"))

# Summarise the data in terms of the sample proportions
xtabs( ~ Answer, data = times.df) |>
  proportions()
Answer
      Yes        No 
0.5897436 0.4102564 

CS 7.2: Checking assumptions

Independence

It has been met as the survey randomly selected adult American Catholics

Heuristic 1

It is quite clear from the bar plot that \(n \times \hat{p} \geq 10\) and \(n \times (1 - \hat{p}) \geq 10\). So we can construct a confidence interval for \(p\)

Heuristic 2

If we test \(p_0 = 0.5\), then \(n \times p_0 \geq 10\) and \(n \times (1 - p_0) \geq 10\) are true statements. So we can conduct a hypothesis test for \(p\) as well

# Review T01: Exploring Data
xtabs(~ Answer, data = times.df) |>
  as.data.frame() |>
  barchart(Freq ~ Answer, data = _, origin = 0,
    main = "Distribution of survey responses",
    xlab = "Do you favour allowing women to be priests?",
    ylab = "Frequency")

Figure: The 507 survey responses to ‘Do you favour allowing women to be priests?’

CS 7.2: Using R & a data file for the analysis

# We can make use of R's pipe operator to "forward" the one-way table of counts
  # that summarises the data file
xtabs(~ Answer, data = times.df) |>
  prop.test(correct = FALSE, p = 0.5)

    1-sample proportions test without continuity correction

data:  xtabs(~Answer, data = times.df), null probability 0.5
X-squared = 16.333, df = 1, p-value = 5.312e-05
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5464089 0.6317285
sample estimates:
        p 
0.5897436 

CS 7.2: Interpretation of the output

95% CI for \(p\)
With 95% confidence, we estimate that the underlying proportion of all American Catholics who were in favour of allowing women to be priests is somewhere between 54.5 and 63.2 percent

Hypothesis Test for \(p = 0.5\)
At the 5% level of significance, we reject the null that the underlying proportion of all American Catholics who were in favour of allowing women to be priests is 50 percent, in favour of the alternative that it is not
(p-value ≈ 0)

xtabs(~ Answer, data = times.df) |>
  prop.test(correct = FALSE, p = 0.5)

    1-sample proportions test without continuity correction

data:  xtabs(~Answer, data = times.df), null probability 0.5
X-squared = 16.333, df = 1, p-value = 5.312e-05
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5464089 0.6317285
sample estimates:
        p 
0.5897436